R is equipped with multiple and extensive visualization capabilities.
The following sections will explore the capabilities of the following:
Plot function in R programming language is a basic function. It can be used to visualize data in 2D format by creating graphs and charts and visualize correlation among data variables. Common graphs used in this function include scatter plots and line graphs (EDUCBA, 2020).
The generic syntax for the Plot function is:
Plot(x,y…)
And a more advanced function is:
plot(x, y, type, main, sub, xlab, ylab)
Where you can assign the type of plot you want to see by:
“p”: points
“l”: lines
“b”: both point and lines in a single place
“c”: join empty point by the lines
“o”: both lines and over-plotted point
“h”: histogram “s”: stair steps
“n”: no plotting
main: text for the main title
sub: text for the sub-title (under x-axis)
xlab: x-axis legends
ylab: y-axis legends
We have 10 students in two different courses and their grades for their recent exam.
The X variable denotes the first course and the Y variable denotes the second course.
X = 40, 15, 50, 12, 22, 29, 21, 35, 14, 15
Y = 41, 42, 32, 14, 42, 27, 13, 50, 33, 22
Put them into the correct syntax and create a line plot using type “l”
First, define X in a vector and then use the assigned variable and declare a lines plot using the plot function.
X = c(40, 15, 50, 12, 22, 29, 21, 35, 14, 15)
plot(X ,type = "l")
Next, define Y in a vector and then use the assigned variable and declare a points plot using the plot function.
Y = c(41, 42, 32, 14, 42, 27, 13, 50, 33, 22)
plot(Y ,type = "p")
There are a plethora of ways to use the basic plot() function to your advatage.
Below are a representation of other visualization capabilities within the default plot function.
Source: Kumar, 2020
Case Study Source: Qian, 2020
This package allows us to access the latest data and historical data of cases of all countries, plot data on a map, and create various graphs.
We can configure his data by first installing his package via Github.
Package: nCov2019
By: Dr. Guangchuang Yu (Southern Medical University)
remotes::install_github("GuangchuangYu/nCov2019")
library(nCov2019)
get_nCov2019()
load_nCov2019()
Then we check that the packages necessary for visualization are installed properly.
require(nCov2019)
require(dplyr)
Now we can get a first impression of the dataset.
The get() function searches and calls a data object and the load() function makes sure all of the R objects saved in the file are loaded into R.
By assigning x to the function below, it triggers download of statistical data of COVID-19.
By assigning y to the function below, it triggers to load historical data of COVID-19.
x <- get_nCov2019()
y <- load_nCov2019()
We can then check the results for x and y accordingly. X informs us the total number of cases in China and Y informs us when the data was last updated. By printing x and y, it will refresh the data as well.
x
China (total confirmed cases): 95901
last update: 2020-12-21 20:45:32
y
nCov2019 historical data
last update: 2020-11-26
We can also check worldwide statistics easily for details on confirmed cases, deaths, etc.
This function automatically sorts the entire data set by number of confirmed cases.
x['global']
name confirm suspect dead deadRate showRate heal
1 China 95901 7 4771 4.97 FALSE 89480
2 United States 18277433 0 324898 1.78 FALSE 10622096
3 India 10055560 0 145810 1.45 FALSE 9606111
4 Brazil 7238600 0 186764 2.58 FALSE 6409986
5 Russia 2850042 0 50723 1.78 FALSE 2273510
6 France 2529756 0 60665 2.4 FALSE 189638
7 United Kingdom 2079564 0 67718 3.26 FALSE 4380
8 Turkey 2043704 0 18351 0.9 FALSE 1834705
9 Italy 1964054 0 69214 3.52 FALSE 1281258
10 Spain 1817448 0 48926 2.69 FALSE 196958
11 Argentina 1541285 0 41813 2.71 FALSE 1368346
12 Germany 1531998 0 26655 1.74 FALSE 1129280
The plot() function is very versatile and includes the capability to visualize data on a map.
Since we already assigned x to a function, we can plot them in the plot() function below to get a static heat map.
plot(x)
Source: Self-made graphic on R Studio
Unlike the Plot function that exists by default on the R platform, you can download the ggplots2 package which will allow you to create visualization of data by providing ggplot2 with the information on how you want to map variables to aesthetics. That is why it is called ggplots since ‘gg’ stands for ‘Grammar of Graphics’ (Pedersen, 2020).
ggplot is a package that makes it simple to create complex plots from data in a data frame.
It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties. Therefore, we only need minimal changes if the underlying data change or if we decide to change from a bar plot to a scatterplot.
This helps in creating publication quality plots with minimal amounts of adjustments and tweaking (Holtz, 2020).
Below are some examples of the capabilities of ggplot. In contrast to plot(), it is evident that the aesthetic and programmability capabilities are much more advanced.
Source: Holtz, 2020
Based on the same data as Case Study 1, it is possible to extract the top 10 countries with confirmed cases and plot them on a ggplot.
# obtain top 10 country
d <- y['global'] #extract global data
d <- d[d$country != 'China',] #exclude China
n <- d %>% filter(time == time(y)) %>%
top_n(10, cum_confirm) %>%
arrange(desc(cum_confirm))
# plot top 10 on a graph since Feb 01 to most recent date in dataset
require(ggplot2)
require(ggrepel)
ggplot(filter(d, country %in% n$country, d$time > '2020-02-01'),
aes(time, cum_confirm, color=country)) +
geom_line() +
geom_text_repel(aes(label=country),
function(d) d[d$time == time(y),])
This results in a colorful, easy-to-see line graph like below.
This graph shows the top 10 countries with the most confirmed cases outside of China. The United States, India and Brazil are the top 3 most infected countries and show exponential growth. Meanwhile, the 7 other countries under Brazil have flattened their curve and slowed down the growth with an effective containment strategy. Some other European countries such as France and Italy are also seeing hundreds and thousands of new cases.
Source: Self-made graphic on R Studio
Below is a gauge type plot on ggplot().
Often times quartiles and extreme percentiles of continuous distribution are displayed with box plots. However, with insufficient information and box plots being a very ‘statistical’ diagram, gauge plots are much easier to understand and are often used as common dashboard applications.
Utilizing the 181 publicly reported confirmed cases in China, one modeling study estimated that symptoms of COVID-19 would develop in 2.5% of the infected within 2.2 days and in 97.5% of the infected within 11.5 days. The maximum of the infected shows within 14 days, which explains why the quarantine periods required in many countries are between 10 to 14 days (Kanevsky, 2020).
Source: Kanevsky, 2020
The spectrum of illness severity ranges from mild to critical and most infections are not severe. In analyzing a study involving 44,500 confirmed cases conducted by the Chinese Center for Disease Control and Prevention, the estimation of the infection severity was reported as the following (Kanevsky, 2020):
- 81% mild (no or mild pneumonia)
- 14% severe (with dyspnea (shortness of breath) or hypoxia (low blood oxygen levels))
- 5% critical (respiratory failure, shock of multi-organ dysfunction)
- 2.3% Case Fatality Rate (CFR)
There were no deaths reported in non-critical cases.
Source: Kanevsky, 2020
The most frequent manifestation of infection is ‘pneumonia’, which can be characterized by symptoms such as fever, cough, dyspnea (shortness of breath), and bilateral infiltrates on chest imaging. An analysis of 138 pneumonia patients in Wuhan, China resulted in the most common clinical features at the onsent of illness as (Kanevsky, 2020): - Fever 99% - Fatigue 70% - Dry cough 59% - Anorexia 40% - Myalgias (muscle aches) 35% - Dyspnea (shortness of breath) 31% - Sputum production (coughing up and spitting out the material produced in the respiratory tract) 27%
Source: Kanevsky, 2020
The Case Fatality Rate (CFR) ranged from 5.8% in Wuhan to 0.7% across the rest of China. Most of the fatal incidents were among those of old age or preexisting conditions. The proportion of severe or fatal infections may vary by location. For example in Italy, 12% of all COVID-19 infected persons and 16% of all hospitalized patients were admitted to the ICU and in mid-March their estimated CFR was 7.2%. The median age of patients with infections in Italy were 64 years old. In comparison the CFR rate in South Korea in mid-March was 0.9% and the median age was in the 40s (Kanevsky, 2020).
Source: Kanevsky, 2020
The exact interval in which a COVID-19 patient is infectious is uncertain. Many data that backs the interval of infection are studies based on viral RNA detection from respiratory and other specimens however, detection of viral RNA does not necessarily mean that the virus is present (Kanevsky, 2020).
Based on a series of studies, it can be assumed that viral RNA levels appear higher in the onset of the infection rather than later. Therefore, it can be assumed that the possiblity of transmitting the infection may be more likely in the earlier stage than the later stages (Kanevsky, 2020).
The duration of viral shedding also varies and it appears that there is a wide range of possibilities depending on the severity of the illness. In one study, 90% of 21 patients with mild illness repeatedly showed negative viral RNA tests on nasal swabs 10 days after the onset of symptoms and positive tests were longer in patients with more severe illnesses. In another study, 137 patients who had recovered from COVID-19 showed the median duration of viral shedding to be 20 days (Kanevsky, 2020).
Source: Kanevsky, 2020
Rather than a static map, maps can also be visualized overtime in dynamic form through a Magick R Package.
Using the same variables set previously, it is possible to create a moving heat map.
The below moving heatmap is the development of COVID-19 from February 1st, 2020 to March 31st, 2020.
It is possible to see that the virus originated in China and spread across the world.
install.packages("magick")
library(magick)
require(nCov2019)
x <- get_nCov2019()
y <- load_nCov2019()
y <- load_nCov2019()
d <- c(paste0("2020-02-", 1:29), paste0("2020-03-", 1:31))
img <- image_graph(1200, 700, res = 96)
out <- lapply(d, function(date){
p <- plot(y, date=date,
label=FALSE, continuous_scale = TRUE)
print(p)
})
dev.off()
animation <- image_animate(img, fps=2)
print(animation)
Source: Self-made graphic on R Studio
Holtz, Y., 2020. Data Visualization With R And Ggplot2. [online] R-graph-gallery.com. Available at: https://www.r-graph-gallery.com/ggplot2-package.html.
Kanevsky, G., 2020. Facts About Coronavirus Disease 2019 (COVID-19) In 5 Charts Created With R And Ggplot2. [online] Novyden.blogspot.com. Available at: https://novyden.blogspot.com/2020/03/facts-about-coronavirus-disease-2019.html.
Kumar, P., 2020. Understanding Plot() Function In R. [online] JournalDev. Available at: https://www.journaldev.com/36083/plot-function-in-r.
Pedersen, T., 2020. Ggplot2 Package. [online] Rdocumentation.org. Available at: https://www.rdocumentation.org/packages/ggplot2/versions/3.3.2.
EDUCBA. 2020. Plot Function In R. [online] Available at: https://www.educba.com/plot-function-in-r/.
Qian, X., 2020. Visualize The Pandemic With R #COVID-19. [online] Medium. Available at: https://towardsdatascience.com/visualize-the-pandemic-with-r-covid-19-c3443de3b4e4.